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Abstract 

We propose localized functional principal component analysis (LFPCA), looking 
for orthogonal basis functions with localized support regions that explain most of the 
variability of a random process. The LFPCA is formulated as a convex optimization 
problem through a novel Deflated Fantope Localization method and is implemented 
through an efficient algorithm to obtain the global optimum. We prove that the pro¬ 
posed LFPCA converges to the original FPCA when the tuning parameters are chosen 
appropriately. Simulation shows that the proposed LFPCA with tuning parameters 
chosen by cross validation can almost perfectly recover the true eigenfunctions and sig¬ 
nificantly improve the estimation accuracy when the eigenfunctions are truly supported 
on some subdomains. In the scenario that the original eigenfunctions are not localized, 
the proposed LFPCA also serves as a nice tool in finding orthogonal basis functions 
that balance between interpretability and the capability of explaining variability of 
the data. The analyses of a country mortality data and a growth curve data reveal 
interesting features that cannot be found by standard FPCA methods. 

Keywords: functional principal component analysis, domain selection, interpretability, 
orthogonality, deflation, convex optimization. 


1 Introduction 

Functional principal component analysis has emerged as a major tool to explore the source 
of variability in a sample of random curves and has found wide applications in functional 
regression, curve classification, and clustering (Castro et al., 1986; Rice & Silverman, 1991; 
Cardot, 2000; Yao et al., 2005; Ramsay &; Silverman, 2005; Hall & Hosseini-Nasab, 2006). 
In this paper, we consider functional principal component analysis with localized support 
regions. That is, for a smooth random function X, we look for orthogonal basis functions 
with localized support regions that explain most of the variance. The main motivation of 
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the localized functional principal component analysis (LFPCA) is to find a parsimonious 
linear representation of the data that balances the interpretability and the capability of 
explaining variance of the stochastic process. 

The proposed method outputs localized basis functions whose localization level is controlled 
by a localization tuning parameter. We propose two methods to select the localization 
parameter, corresponding to two useful applications of our proposed method. First, one 
can choose the localization parameter by maximizing the explained variance of the random 
process computed by V-fold cross-validation. Our simulation shows that when the eigen¬ 
functions truly have localized support regions, the proposed LFPCA with a localization 
parameter chosen by cross-validation significantly improves the estimation accuracy of the 
eigenfunctions compared to standard FPCA methods. On the other hand, when the original 
eigenfunctions are not localized, the localization parameter chosen by cross-validation is 
expected to be very close to zero and it is confirmed by our numerical studies that the 
performance of the proposed LFPCA is almost identical to standard FPCA methods. The 
second method of choosing the localization parameter is to seek the most localized basis 
functions that explain a fixed level of variance. This method is particularly useful when the 
standard eigenfunctions are not localized, and it makes sense for the proposed LFPCA not 
to target at the standard eigenfunctions but to balance between interpretability and the 
capability of explaining variance of the stochastic process. Details can be found in Section 
3.2. We consider two data applications to illustrate the second method of choosing the 
localization parameter. One is a country mortality data, where the mortality rates at age 
60 were recorded from year 1960-2006 for 27 countries around the world. The first three 
localized basis functions, explaining more than 85% variance, correspond to variational 
modes around mid 1990s, 1980s, and 1960s, respectively. Another example is a growth 
curve data, where the height of 54 girls were densely recorded from year 1 to year 18. The 
first two localized basis functions explain more than 85% variance, and clearly indicate that 
the main variational modes of female height growth are around age 12 and around age 5, 
perfectly matching the knowledge of growth spurts. These interesting features cannot be 
revealed by standard FPCA. 

Domain localization has been studied by several authors in the functional regression model: 
E(Y\X ) = a + f T X(t)/3(t)dt, where the coefficient function /3(f) is desired to be zero 
outside a subdomain 7o £ T with the purpose of improved interpretability (James et al., 
2009; Zhao et ah, 2012; Zhou et ah, 2013). Most of these methods turn the problem 
into a variable selection problem and use LASSO type penalties. In a recent thesis work 
Lin (2013), interpretable functional principal component analysis is studied, which has 
a very similar flavor as our proposed LFPCA. In their work, an penalty is added on 
eigenfunctions and a greedy algorithm based on basis expansion of curves is developed to 
solve the non-convex optimization problem. The formulation of localization or domain 
selection in the context of functional principal component analysis is quite challenging for 
at least two reasons. First, the eigen problem together with a localization penalty is usually 
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not convex, and in general it is an NP-hard problem to find a global optimum. Second, in 
order to obtain a sequence of mutually orthogonal eigen-components, a commonly taken 
procedure is to deflate the empirical covariance operator at step j by removing the effect 
of the previous j — 1 components (White, 1958; Mackey, 2008). But with the localization 
penalty in the objective function, this procedure can not guarantee the orthogonality of such 
sequentially obtained eigen-components. In sequential estimation of principal components, 
being orthogonal to the first component is a natural requirement when looking for the 
second component, otherwise the maximization over second direction is not well-defined 
since the solution would still be the first direction. From a dimension reduction perspective, 
the orthogonality is also appealing since the resulting k dimensional orthogonal basis leads 
to very simple calculation for subsequent inferences. 

The main contribution of this paper is three-fold. First, we formulate the LFPCA as a convex 
optimization problem with explicit constraint on the orthogonality of eigen-components. 
Second, we provide an efficient algorithm to obtain the global maximum of this convex 
problem. Third, we carefully investigate the estimation error from the discretized data 
version to the functional continuous version, as well as the complex interaction between 
the eigen problem and the localization penalty, and prove consistency of the estimated 
eigenfunctions. The starting point of our method is a sup-norm consistent estimator of the 
covariance operator, up to a constant shift on the diagonal. For dense and equally spaced 
observations with or without measurement error, the proposed method can be directly 
carried out on the sample covariance, i.e., without the need to perform basis expansion, 
smoothing of the individual curves, or smoothing of the estimated covariance operator. For 
other designs of functional data, the proposed method is still applicable when an appropriate 
covariance estimator is available. 

Our formulation of LFPCA borrows ideas from recent developments in sparse principal 
component analysis. In Vu et al. (2013); Lei & Vu (2015), a similar convex framework based 
on Fantope Projection and Selection has been proposed to estimate a k dimensional sparse 
principal subspace of a high dimensional random vector (see also d’Aspremont et al. (2007) 
for k = 1). These sparse subspace methods are useful when the union of the support regions 
of several leading eigenvectors is sparse. In sparse PCA settings, the notion of sparsity 
requires the proportion of non-zero entries in the leading eigenvectors to vanish as the 
dimensionality increases and therefore it makes sense to consider the union of the support 
regions of several leading eigenvectors to be sparse. However, in functional data settings, the 
length ratio of a support subdomain over the entire domain is determined by the random 
curve model and usually a constant, and the union of several leading subdomains can be as 
large as the entire domain. This is also the reason that we use the notion “localized” instead 
of “sparse”. It has remained challenging to obtain sparse eigenvectors sequentially that each 
one is allowed to have a different support region. A particular challenge is the interaction 
between orthogonality and the sparse penalty. Besides the difference between functional 
PCA and sparse PCA, one main extension developed in our method is the construction 
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of a deflated Fantope to estimate individual eigen-components sequentially, with possibly 
different support regions and guaranteed orthogonality. This deflated Fantope formulation 
is of independent interest in many other structured principal component analysis. 

The rest of this paper is organized as follows. In section 2, we introduce the formulation 
of localized functional principal component analysis. Section 3 derives the solution to 
the optimization problem and describes the algorithm as well as the selection of tuning 
parameters. Section 4 contains the consistency results. Section 5 and Section 6 present 
numerical experiments and data examples to illustrate our method. Section 7 contains some 
discussions and extensions. Technical details are included in the Appendix. 


2 LFPCA Through Deflated Fantope Localization 

We consider a square integrable random process X(t) : T H > M over a compact interval T C 
M, with mean and covariance functions fi(t) = EX(t) and T(s,t) = Co v(X(s), X(t)), and 
covariance operator (Tf)(t) = f sG j- s)ds. Under the minimal assumption that T(s, t) 

is continuous over ( s,t ), this operator T has orthonormal eigenfunctions (j)j(t),j = 1, 2, • • ■ , 
with nonincreasing eigenvalues A j, satisfying Tcj)j = AThe well known Karhunen-Loeve 
expansion then gives the representation 

OO 

x(t) = n(t) + '22t j <i> j (t), ( 1 ) 

3 =i 

where £j, j > 1, is a sequence of uncorrelated random variables satisfying E(£j) = 0 
and var(^j) = A j, with an explicit representation = f te7 -(X(t) — n(t))(/)j(t)dt. A key 
inference task in functional principal component analysis (FPCA) is to estimate the leading 
eigenfunctions q 7, (7), 1 < j < k, from n independent sample curves. 

In practice, the underlying sample curves Xi(t), 1 < i < n, are usually recorded at a grid 
of points and may be contaminated with additive measurement errors. We start with dense 
and equally spaced observations 

OO 

Yu = Xj(tj) + eg = n(ti) + ^ + eg , i = l,...,n, l = l,...,p, (2) 

j= i 

where eg are independent noises with mean zero and variance a 2 , and ti, 1 < l < p, are grid 
points in T at which observations are recorded. Starting from such discrete and possibly 
noisy data, there are different ways of introducing smoothness in the estimation of FPCA. 
Rice & Silverman (1991); Silverman (1996); Huang et al. (2008), among others, studied 
approaches where a smoothness penalty on eigenfunctions is integrated in the optimization 
step of eigen-decomposition. Let S be the p x p sample covariance matrix of the observed 
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vector Y. and v be a p dimensional vector. Rice & Silverman (1991) used a roughening 
matrix D = A T A where A £ M^ _2 ) x p is a second-differencing operator: 

( 1, if j £ {i,i + 2} , 

Ajj = < -2, if j = i + 1, 

[ 0, otherwise. 

A smoothed eigenfunction estimator is obtained by solving the following eigen problem: 

maxv T (S — p\D)v, s.t. ||u|| 2 = 1, (3) 

where p\ is a smoothing parameter, and ||u|| 2 is the Euclidean norm of v. 

A straightforward approach to localize the estimated eigenfunctions is to add another 
localization penalty: 


maxv T (S — p\D)v — p 2 \\v\\ 1 , s.t. ||u|| 2 = 1, (4) 

where p 2 is a tuning parameter, and ||u|| 1 is the t\ norm of v. However, this is not a convex 
problem and there are no known algorithms that can efficiently find a global optimum even 
for the first eigen-component. 

Here we propose a novel sequential estimation procedure based on the idea of estimating the 
rank one projection matrix vv T . Let (A, B) = trac e(A T B) for matrices A, B of compatible 
dimensions. Denote ||Lf|| 11 the matrix l\ norm, which is the sum absolute value of all 
entries in H. Starting from Ho = 0 , for each j = 1,..., k, the jth localized eigen-component 
is estimated as follows. 

Hj = argmax(S — p\D,H) — p 2 \\H\\ ll , s.t. H £ T>~ , 

Vj = the first eigenvector of Hj, (5) 

ft? = fij-i + VjVj , 

where, for any p x p projection matrix n, 

V n := {H : 0 A H < I, trac e(H) = 1 , and (H , n) = 0 }, 

and, for symmetric matrices A, B, “ A A B v means that B — A is positive semidefinite. 

Problem (5) is a convex relaxation of (4) with integrated orthogonality constraints on the 
estimated localized eigen-components. With the estimated Vj, we can easily obtain an 
estimate <pj{t) of the localized eigenfunction <f>j(t) by standard interpolation techniques 
such as linear interpolation, plus an optional final step of re-orthogonalization and re¬ 
normalization. 
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With appropriately chosen tuning parameters, the performance of the proposed method ties 
to the maximum entry-wise error of the discretized covariance estimator S (see Section 4 
for details). Our presentation will focus on a sample covariance S that is computed from 
dense and equally spaced observations, but the proposed method is not restricted to a 
dense regular design as long as a reasonable covariance estimate can be obtained. More 
discussions can be found in Section 7. 

To solve for iij and cf>j(t), the key step in (5) is to solve 

max(S — piD, H) — p 2 \\H\\ 11 , s.t. H E V n , (6) 

where II = IIj_i at step j. In next section we present an algorithm that solves problem (6), 
with a discussion on the choice of tuning parameters p\ and p 2 - 

In the sparse PCA literature, d’Aspremont et al. (2007); Vu et al. (2013) have considered 
the following problem, 

max(S,H)-p\\H\\ 1:1 , s.t. H (7) 

where the convex set J~ d := {H : 0 A H < /, trace(id) = d} is called the Fantope of 
degree d (Dattorro, 2005), and is the convex hull of all rank d projection matrices. While 
the convex relaxation given in (7) allows us to estimate d-dimensional sparse principal 
subspaces, it does not lead to a sequence of mutually orthogonal eigenvectors with different 
support regions. 

To ensure orthogonality among the estimated eigenvectors, we consider T> u := {H : H E 
F 1 , and (H, II) = 0}, which we call the deflated Fantope. It can be naturally generalized 
to T>yi '■= {H : H E F d , and (id, II) = 0} to estimate mutually orthogonal principal 
subspaces. Such a feasibility deflation technique is quite different from the commonly 
suggested matrix deflation techniques in sequential estimation of eigenvectors (see Mackey 
(2008) for example). 


3 Algorithm 

3.1 Deflated Fantope Localization using ADMM 


The main difficulty in solving problem (6) is the complex interaction between the t\ penalty 
and the deflated Fantope constraint. To overcome this difficulty, we write (6) in an equivalent 
form to separate the i\ penalty and deflated Fantope constraint: 


min Iv n (H) - (S - piD, H) + p 2 \\Z \\ 11 , 

Jtl , Zj 

s.t. H — Z = 0 , 


( 8 ) 
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Algorithm 1 Deflated Fantope Localization using ADMM 

Require: S = S T , II, D, pi, p2 > 0, r > 0, e > 0 


z(°) «- 0, W (°) «- 0 

> Initialization 

repeat r = 1,2,... 


ff (r) <- Vv n [Z( r -^ - VF^- 1 ) + (S - P!D)/t\ 

> Deflated Fantope projection 

Z W <- S P 2 /t (H to + W( r -V) 

> Elementwise soft thresholding 

Vp( r ) + H M 

> Dual variable update 

until || RW - Z^f F Vr 2 ||ZW - Z^ r ~^\\ 2 F < e 2 

> Stopping criterion 

return Z^A 



where \x> is the convex indicator function, which is oo outside T> u and 0 inside V u . Problem 
(8) is a convex global variable consensus optimization, which can be solved using alternating 
direction method of multipliers (ADMM, Boyd et al. (2011)). We describe in Algorithm 1 
an ADMM algorithm that solves (8) and hence (6). It extends the FPS algorithm in Vu 
et al. (2013) to the deflated Fantope. 

The two matrix operators used in the algorithm are defined as follows. 

(i) Soft-thresholding operator: for any a > 0, 

S a (x) = sign(x) max(|a;| — a, 0). 


(ii) Deflated-Fantope-projection operator: For any pxp symmetric matrix A and projection 
matrix II, 

Vv n {A) := arg min ||A - B\\% 

BeT> n 

is the Frobenius norm projection of A onto the deflated Fantope D n . 

A non-trivial subroutine in Algorithm 1 is to calculate the deflated-Fantope-projection 
Vx> u {A) for a symmetric matrix A. The following lemma gives a close-form characterization 
of the deflated-Fantope-projection operator. 

Lemma 1. Let II = VV T , where V is a p x d matrix with orthonormal columns. Let U 
be a p x (p — d) matrix that forms an orthogonal complement basis of V. Then 


VvJA) = U 


'p—d 


'(0)rkVi 


_i =1 


u 1 


where (7i, r li)i=f are eigenvalue-eigenvector pairs of U T AU: U T AU = l'i r h r lT ; an d 

7 ’f{6) = min(max( 7 j — 0, 0), 1), with 0 chosen such that Yli=] it W = 1- 


7 







The next theorem ensures the convergence of our algorithm to a global optimum of problem 
(6). The proofs of Lemma 1 and Theorem 2 are deferred to the Appendix. 

Theorem 2. In Algorithm 1, Z M -» H*, H M —> H* as r -> oo, where H* is a global 
optimum of problem (6). 

In the proof of Theorem 2 we will see that the auxiliary number r used in Algorithm 1 
plays a role that is similar to the step size commonly seen in iterative convex optimization 
solvers. The particular choice of r does not affect the theoretical convergence of the ADMM 
algorithm. There are some general guidelines on the practical choice of r and we refer to 
Boyd et al. (2011) for further details. 

3.2 Choice of Tuning Parameters 

The optimization problem (5) involves two tuning parameters: p\ controls the roughness 
of eigenfunctions; and p2 controls the localization of the eigenfunctions. We present a 
two-step approach where p\ is chosen first and kept the same for all 1 < j < k, and then 
P2 is determined sequentially for each eigenfunction <pj(t) and denoted by p 2 ,j- The two 
parameters can be chosen together by straightforward modification but computationally it 
would be a bit intensive. 

The choice of the smoothing parameter p\ has been discussed in Rice Sz Silverman (1991), 
and they recommended cross-validation or manual selection. In our simulation and data 
analysis, we have used R-fold cross-validation. First the data is divided into V folds, 
denoted by Pi, V2, ■ • •, Vy. Let H'- v \pi, P2) be the estimated Hj in (5) using data other 
than V v with tuning parameters p\ and p2- Let S v be the discrete covariance estimated 
from data V v . The smoothing parameter is chosen by maximizing the cross-validated inner 
product of h\ ^ and S v : 

v 

pi =argmaxy 2 (H^ L ’\p, 0 ),S v ), (9) 

P £A 1 V=1 

where A\ is a candidate set of p\ and empirically we found that a sequence between 0 and 
p times the largest eigenvalue of S works well. 

In the following, we present two methods for the choice of P2 given a pre-chosen smoothing 
parameter p\. The first method is to choose p2j by maximizing the cross-validated inner 
product of Hj v ' > and S v : 

V 

P2,j = arg max^^(Hj~ v \pi, p), S v ) , j = 1,2 

P&A2J ^ 


( 10 ) 


where A- 2 ,j is a candidate set for p 2 j, and we propose to use a sequence between 0 and the 
95% quantile of absolute values of off-diagonal entries in Sj, with Sj = (/ — Ilj-i)S(I — 

The %-fold cross-validation approach is expected to give a p 2 that indicates the true 
localization level of the eigenfunctions. The criterion in our proposed cross-validation 
corresponds to maximizing (T,, H(p 2 )), where £ is the discretized population covariance 
and is substituted by the test sample covariance in practice. When the true eigenvector 
v is localized, (£, H(p 2 )) shall be maximized at approximately the value of p\ which 
corresponds to the ideal localization level of the eigenfunction, i.e., ||#(/? 2 )|li i ~ ||uu T ||i 1 . 
We shall expect (£,iL(p 2 )) as a function of p 2 to be (i) monotonically increasing on [0 , p^\ 
as the search area gradually expands to cover the true eigenvector, and (ii) monotonically 
decreasing on [/> 2 ,oo) as the search area goes unnecessarily larger so that the estimation 
becomes more noisy. When the true eigenvector v is not localized, then we shall expect 
(£,i/(/? 2 )) to be monotonically decreasing as p 2 increases. The numerical study confirms 
the good performance of the cross-validation method. See Section 5 for more details. 

In some applications, we may not want to target the standard eigenfunction, but instead we 
may want to find orthogonal linear expansions that balance the interpretability (localization) 
and the capability of explaining the variance of the process. We therefore propose a second 
method of choosing p 2 , which is based on the notion of fraction of variance explained (FVE). 
For a p-dimensional vector v of unit length and a sup-nornr consistent estimator S of the 
covariance operator, 

FVE{v) = v T Sv/totV{S), (11) 

where totV ( S ) is the sum of positive eigenvalues of S. We note that for dense and equally 
spaced observations with measurement error, the FVE(y ) defined above is not directly 
applicable to a sample covariance S because the sample covariance is sup-norm consistent 
up to a shift o 2 on the diagonal, where <r 2 is the error variance. To avoid serious bias 
by the nugget effect on the diagonal, one may use the eigenvalues from the smoothed 
covariance. Another practical way is to approximate totV{S ) by the sum of the first M 
leading eigenvalues of S, for a finite number M. For a reasonable error level, the nugget 
effect bias Mo 2 is small compared to the sum of the first M eigenvalues of S, which is of 
order p (Kneip et al., 2011), while the remaining true eigenvalues beyond M are usually 
very small because the smoothness of X(t) ensures fast decay of eigenvalues. In numerical 
experiments where FVE is needed for determining the number of principal components to 
be included, we use M = min(20,p — 2). 

To sequentially select the sparsity parameter p 2 for the jth eigenfunction. Suppose that we 
have estimated Vi for 1 < i < j — 1. Let vj{p\, p) be the solution of (5) by using a fixed p\ 
and p 2 = p , with IIj_i being the projector of the subspace spanned by (vi : 1 < i < j — 1). 
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We can define 


rFVE(p) = 


FVE(vj(pl,p )) 
FVE(vj(p$, 0 )) 


vj( pI, p)Sv j (p* 11 p) 
vJ(p* 11 0)Sv j (p* 1 ,0) 1 


( 12 ) 


and choose p r 2 .j as 


max{p E A2,j ■ rFVE(p) > 1 — a}, 


(13) 


where a E [0,1) is the proportion of FVE that one chooses to sacrifice in return of localization. 
For any a E [0,1), a p satisfying (13) always exists, because rFVE(p) = 1 for p = 0 and 
rFVE(p) E [0,1) for p > 0. Equation (12) also suggests that rFVE(p) can be calculated 
without computing totV ( S ). Although the first localized basis function explains less variance 
than the standard eigenfunction, the lost proportion is likely to be picked up by the second 
component, and we are still able to explain a large proportion of the total variance with a 
small number of components. We illustrate this method with real data analyses in Section 
6 . 


4 Asymptotic Properties 

In this section we establish the 1 2 consistency of the proposed estimator in an asymptotic 
setting where both the sample size n and the number of grid points p increase. We will 
first provide sufficient conditions on the tuning parameters p± and P 2 such that the LFPCA 
estimate is consistent. Our second result provides further insights on how the localization 
penalty p 2 affects the rate of convergence. We make the following assumptions. 

Al. The input matrix S in (5) satisfies sup-norm consistency up to a constant shift on the 
diagonal: for some constant a > 0 and a sequence e n = o(l), 

max \S(l, l') — r(f/, tp) — al(l = l')\ = Op(e n ), as (n,p) 00 . 

Remark: Assumption (Al) puts a mild condition on the input matrix S that can be 
satisfied by many standard estimators. Consider functional data with dense and equally 
spaced observations. If S is the sample covariance estimator from the raw data, standard 
large deviation bounds such as Bernstein’s inequality (Van Der Vaart & Wellner (1996), 
Chapter II) imply that Assumption (Al) holds with e n = y/logp/n if log p/n —> 0 and the 
random curve X ( t ) as well as the observation error in model (2) has sub-Gaussian tails; see 
also Kneip et al. (2011). In this case a = cr 2 , the noise variance. If a smoothed covariance 
estimator S is used, the sup-norm rate can be ydogre/n (Li et al., 2010). The convergence 
in other norms such as Frobenius norm can be found in (Hall & Hosseini-Nasab, 2006; Hall 
et al., 2006; Bunea & Xiao, 2015). In general, the consistency result does not really depend 
on the observational design as long as we can get an estimate of the covariance operator 
whose sup-norm error vanishes as n and p increase. More discussions about cases where a 
sample covariance is not feasible can be found in Section 7. 
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A2. There is a positive integer k such that the eigenvalues of T satisfies Ai A& > 

Afc+i >,...,> 0, with positive eigen-gap 6 := mini<j<fc(Aj — Aj+i) > 0. 

A3. T is Lipschitz continuous: 

|T(s, t ) — r(s', t')\ < Lmax(|s — s'\, |i — t'\) , V s, s' , t, t'. 

A4. The k leading eigenfunctions of T have Lipschitz first derivatives: 

I <t>j(t) - 4>'j{.s) | <L\t-s\, VI <j<k. 

Theorem 3 (£2 consistency). Under assumptions (A1-A4), if p±/p 5 —> 0 and P 2 —3 > 0 as 
( n,p ) —> 00 , then for k as defined in A2, we have 

sup || <f>j{t) - 4>j{t )|| 2 4 0. 

i<j<fc 


The proof of Theorem 3 is deferred to the Appendix. Here we outline the proof, highlighting 
some key technical challenges. 

Let Uj be the unit vector obtained by discretizing and re-normalizing the eigenfunction 
1 i>j(t). We will prove Theorem 3 by proving 

II- II P n 

sup ||Vj — Uj || 2 —> U. 

1 <j<k 


Let S be the discretized covariance operator, with a possible constant shift a on the diagonal. 
Roughly speaking, Vj and Uj approximate the jtli eigenvector of S and S, respectively. We 
hope to establish the following inequality based on the standard Davis-Kahan sin 0 theorem 
(Bhatia (1997), Theorem VII.3.1), 

\\vjvf -ujuJWp < 4||S-E|| f < ^||5 — 5H|| 0OjOO , (14) 

provided that S and S have eigengap of order 5p (Lemma 6), where ||A|| F = (A, A) 1 ^ 2 is 
the Frobenius norm and ||A|| is the maximum absolute value of all entries in A. 

However, to rigorously obtain an approximated version of (14) is non-trivial. First, Uj is 
not an eigenvector of E because of the discretization error. The discretization error will 
be explicitly tracked in all subsequent analysis when comparing Uj with Vj, for example, 
in the characterization of population PCA problem (Lemma 8). Second, vj is obtained 
by solving a penalized eigenvector problem over the deflated Fantope, and hence is not 
directly comparable to its ideal theoretical counterpart Uj, which may not be in the feasible 
set of problem (5). To overcome this difficulty, we will consider a modified version of 
Uj that is feasible for (5) but still possesses similar smoothness as well as proximity to 
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the true eigenvector of E. Furthermore, the sequential estimation procedure (5) involves 

deflation based on estimated projection matrix IIj_i, which carries over estimation error 

from previous steps. We will use an induction argument to control the sequential error 
accumulation. 

Due to the sequential error accumulation, the convergence rate involves p\ and P 2 in a 
complex way. To provide insights for our localized estimation procedure, the following 
theorem shows how the rate of convergence depends on p 2 when the smoothing parameter 
pi = 0. The proof is included in the Appendix. 

Theorem 4 (Rate of convergence). Under assumptions (A1-A4), if pi = 0 and p 2 — i ► 0 as 
( n,p ) — > oo, then for k as defined in A2, we have 

sup \\<t>j(t) - || 2 = 0 P (e n + p 2 +p~ 1 ) • 

l<7<fc 

The three parts in the rate of convergence correspond to covariance estimation error, bias 
caused by localization penalty, and discretization error, respectively. According to the 
discussion after Assumption Al, the sup-norm covariance estimation error e n can be made as 
small as ydog p/n or yj logn/n depending on the estimating method used and observation 
scheme. Thus if p grows at the same or higher order than y/n, and p 2 = 0(e n ), our LFPCA 
estimate achieves an error rate of e n within a logarithm factor from the standard FPCA 
error rate. 


5 Numerical Study 

To illustrate our methods for localized functional principal component analysis, we conduct 
simulations under two scenarios, Simulation I: localized case and Simulations II: non-localized 
case. For Simulation I, data {Yu,i = 1,..., n, l = 1,... ,p} are generated according to model 
(2), where ti are equally spaced observational points on [0,1]. We set p(t) = 0, ^ ~ JV(0, Ay), 
independent, with A j taken from (4 2 , 3 2 , 2.5 2 , 1.25 2 , 1, 0.75 2 , 0.5 2 , 0.25 2 ) and A j = 0 for 

j > 8, and the measurement errors eu 1V(0, a 2 ). We generate the eigenfunctions as follows. 
Let 4>i(t) = Bs(t), (j> 2 (t) = Be(t ) and (p:i{t) = Bg(t), where is the 6th cubic B-spline 

basis on [0,1], with 8 equally spaced interior knots. For j > 3, <j lj(t) = \/2cos((j + l)vrt) 
for odd values of j and (f>j(t) = \/2sin(j7rt) for even values of j, 0 < t < 1. Then 
0y (i), 1 < j < 8, are obtained by applying Gram-Schmidt orthonormalization on the set of 
1 < j < 8. For Simulation //, we use 4>j(t) = v / 2cos((j + l)7rt) for odd values of j 
and 4>j(t) = \/2 sin(j7rt) for even values of j, 0 < t < 1, for 1 < j < 8. The rest is the same 
as simulation I. 

We investigate the performance of the proposed LFPCA under varying combinations of 
sample size n, number of observations per curve p, and noise level a 2 . More importantly, we 
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(0,0) 


(Pit o) 


NonSeq(p h p 2 ) 


(Phfa) 



Figure 1: True (blue-solid) and estimated (red-dashed) eigenfunctions j = 1,2,3, in 

one run of Simulation /, with n = 100, p = 100 and a = 1, by four different methods as 
described in Section 5. Tuning parameters are chosen by 5-fold cross-validation. 

compare the estimates given by four different methods (i) (pi, P 2 ) = (0,0) corresponds to the 
ordinary PCA estimation directly obtained from the sample covariance; (ii) (pi, p^) = (pi, 0) 
corresponds to the smoothed eigenfunction estimation without localization, where pi was 
chosen by 5-fold cross-validation as discussed in Section 3.2. Empirically, we found these 
estimated eigenfunctions almost identical to those estimated from a smoothed covariance 
function or pre-smoothed individual curves; (iii) NonSeq corresponds to the subspace method 
developed in Vu et al. (2013). For comparison purpose, we incorporate the roughening 
matrix, i.e., use S — p\D as input matrix in (7) with pi chosen by 5-fold cross-validation, 
the same as used in (ii) and (iv). The sparse tuning parameter is chosen by 5-fold cross 
validation p 2 = argmax^g^ p), S v ). We also note that their proposed 

method only outputs k basis vectors of the fc-dimensional subspace, and one needs to 
rotate that basis to obtain eigenvectors, (iv) (pi,p 2 ) = (pi,P 2 ) corresponds to the proposed 
LFPCA with tuning parameters selected by 5-fold cross-validation as detailed in (9) and 
(10). Each setting is repeated 200 times to assess the average performance. The number 
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Figure 2: Noisy observations and recovered functions X-i(t) (red-solid) for nine randomly 
selected subjects, as obtained in one run of simulation I with n = 100, p = 100 and a = 1, 
using (pi,p 2 ) chosen by 5-fold cross-validation. 


of included components k is chosen to account for at least 85% of the total variance, i.e. 
£ U FVE(v : j) > 85%, where FVE is defined in (11). The selected number of k is quite 
robust among all simulation settings, and the average number over 200 simulations is 
3.01. 

The estimated eigenfunctions <ftj(t), j = 1,2,3, from a typical run of Simulation I with 
p = 100, n = 100, and a = 1 are visualized in Figure 1. The four columns correspond to 
results given by the four methods described above. One can clearly see the improvement 
by adding smoothing and localization penalties. The 5-fold cross-validation choice ( pi,p 2 ) 
leads to almost perfect recovery of the true eigenfunctions. The fitted curves Xj(t) = 
fi(t ) + £*=i iij4>j(t ) for nine randomly chosen subjects are shown in Figure 2, where (f)j(t) 
are obtained using (pi, fa). It demonstrates accurate recovery of the true curves Xi(t). 

To better quantify the performance of estimating (j)j{t) we report the i 2 distance \\<frj(t) — 
<f>j{t) || 2 - The medians of the errors over 200 simulation runs are reported in Table 1. The 
results are quite similar for different levels of p and a, so only results for p = 100 and a = 1 
are reported with various sample size n. The errors are found to decline with increasing 
sample size n, as expected. For Simulation I: localized case , the proposed LFPCA with 
p 2 chosen by cross-validation significantly outperforms other methods. The P 2 chosen by 
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Figure 3: Performance of the 5-fold cross-validation to choose p 2 for fa. The top panel is for 
Simulation I: localized case: The peak of the cross-validated inner product (H, S) (dashed 
blue line, left y label) corresponds to fa = 3.8. The estimated fa is more localized as p 2 
increases, and an ideal p 2 is where the || ■ || 11 of estimated H meets that of true discretized 
projection matrix corresponding to fa (indicated by the horizontal line). The bottom panel 
is for Simulation II: non-localized case: fa = 1.2. 


cross-validation well approximates the true localization level of the eigenfunctions; see 
Figure 3 for an illustration and detailed discussions can be found in section 3.2. The NonSeq , 
a subspace method proposed by Vu et al. (2013), does not perform well since the union of 
the subdomains under consideration is the entire domain, not ‘sparse’ at all in their setting. 
The results demonstrate the advantage of our proposed sequential method. For Simulation 
II: non-localized case, 5-fold cross-validation method combined with the proposed LFPCA 
choose faj to be very close to 0, and as expected, the fa errors of the four methods are 
almost identical. 
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Tabic 1: Results for simulation: reporting the median of errors (\\4>j — <A?|| 2 ) f° r 4>ji 3 — 1, 2,3, 
(with median absolute deviations in parentheses) over 200 simulation runs, with a = 1, 
p = 100, and varying sample sizes n, where NonSeq is the subspace method developed in 
Yu et al. (2013). 



n 

Simulation I: Localized 
= 50 n = 100 n = 

= 200 

n 

Simulation II: Non-localized 
= 50 n = 100 n = 200 


(0, 0) 

0.22 

(0.09) 

0.18 

(0.06) 

0.13 

(0.04) 

0.22 

(0.08) 

0.15 

(0.05) 

0.12 

(0.04) 

4>i 

(pi, o) 

0.22 

(0.09) 

0.18 

(0.07) 

0.13 

(0.04) 

0.21 

(0.08) 

0.15 

(0.06) 

0.11 

(0.04) 


NonSeq 

0.22 

(0.09) 

0.18 

(0.07) 

0.13 

(0.04) 

0.21 

(0.08) 

0.15 

(0.06) 

0.11 

(0.03) 


(pi, fa) 

0.12 

(0.05) 

0.10 

(0.03) 

0.06 

(0.02) 

0.22 

(0.08) 

0.15 

(0.05) 

0.13 

(0.03) 


(0, 0) 

0.42 

(0.16) 

0.30 

(0.13) 

0.20 

(0.07) 

0.42 

(0.16) 

0.29 

(0.12) 

0.19 

(0.07) 

4*2 

(fa, o) 

0.41 

(0.16) 

0.30 

(0.13) 

0.20 

(0.07) 

0.41 

(0.15) 

0.28 

(0.11) 

0.19 

(0.06) 


NonSeq 

0.40 

(0.16) 

0.30 

(0.12) 

0.20 

(0.07) 

0.41 

(0.15) 

0.28 

(0.11) 

0.19 

(0.07) 


(pi. fa) 

0.26 

(0.17) 

0.14 

(0.07) 

0.11 

(0.05) 

0.41 

(0.17) 

0.28 

(0.11) 

0.20 

(0.07) 


(0, 0) 

0.37 

(0.12) 

0.27 

(0.11) 

0.18 

(0.07) 

0.31 

(0.10) 

0.26 

(0.09) 

0.18 

(0.06) 

4> 3 

(pi, 0) 

0.36 

(0.12) 

0.27 

(0.11) 

0.18 

(0.08) 

0.30 

(0.10) 

0.26 

(0.09) 

0.17 

(0.06) 


NonSeq 

0.33 

(0.14) 

0.27 

(0.11) 

0.18 

(0.07) 

0.30 

(0.10) 

0.26 

(0.09) 

0.18 

(0.06) 


(fa, fa) 

0.24 

(0.15) 

0.14 

(0.08) 

0.09 

(0.04) 

0.31 

(0.11) 

0.26 

(0.09) 

0.18 

(0.06) 


6 Data Applications 

6.1 Application to Country Mortality Data 

The analysis of human mortality is important in assessing the future demographic prospects 
of societies, and quantifying differences between countries with regard to the overall public 
health measure. Functional data analysis approaches have been previously applied to study 
mortality data (Hyndman et al., 2007; Chiou & Midler, 2009; Chen Sz Muller, 2012). To 
study the variational modes of mortality rates across countries over years, we applied the 
proposed LFPCA method to period life tables for 27 countries, with rates of mortality at age 
60 available for each of the calendar years from 1960 to 2006. The data were obtained from 
the Human Mortality Database (downloaded on March 1, 2011), maintained by University of 
California, Berkeley (USA), and Max Planck Institute for Demographic Research (Germany). 
The data is available at www.mortality.org or www.humanmortality.de, with detailed 
description in Wilmoth et al. (2007). 

Let Xi(t) denote the mortality rate in the ith country for subjects at age 60 during calendar 
year t, where 1960 <t< 2006. We directly compute the sample covariance matrix S from 
the observed data and apply the proposed algorithm to solve problem (5). The p\ chosen 
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Figure 4: Top Row: Estimated eigenfunctions for the mortality data, p\ is chosen by 
5-fold cross validation; Bottom Row: Estimated orthogonal basis functions, p 2 is chosen to 
maintain rFVE at 70% in (12), and the number of components k = 3 is chosen to explain 
at least 85% of the total variance. 

by 5-fold cross-validation (9) is always used to ensure a relatively smooth estimate of the 
eigenfunction, and the solution path along different levels of localization is investigated. The 
5-fold cross-validation method gave P 2 = 0, indicating the eigenfunctions are not exactly 
localized. The estimated eigenfunctions without localization penalty are given in the top 
row of Figure 4. We then choose p 2 by the second method as defined in (13) with a = 30%. 
The estimated localized basis functions, as visualized in the bottom row of Figure 4, reveal 
several interesting features. The first localized basis function 0i(i), explaining 54% of 
the total variance, indicates that a big variation of the mortality functions Xj(t) around 
their mean function happens around mid 1990s. The second basis function 02 (t) with a 
mode around 1980s accounts for 21.2% of the total variation. The third basis function 
03 (t) characterizes variation of mortality around 1960s. Although the first localized basis 
function explains less variance than the first leading eigenfunction, only retaining 70% 
of the capability in return of localization, the lost proportion is picked up by the second 
component. The second component could have explained 30.3% of the variance without 
localization. Therefore, we only need three localized eigenfunctions to account for more 
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than 85% of the total variance. 


6.2 Application to Berkeley Growth Data 

The smooth nature of growth curves has been explored in various previous statistical 
analyses, including functional data analysis approaches. We apply the proposed LFPCA 
method to the Berkeley growth data (Tuddenham & Snyder, 1954). These data contain 
height measurements for 54 girls, with 31 measurements taken between ages 1 year and 
18 years. A sample covariance matrix is computed based on equally spaced measurements 
at every half year from interpolated curves and then the proposed algorithm is used to 
solve problem (5). The solution path along different levels of localization is investigated. 
The estimated eigenfunctions without localization penalty are visualized in the top row of 
Figure 5, and the estimated localized eigenfunctions are given in the bottom row of Figure 
5. The localization level is chosen to maintain rFVE = 70% in (12). The total number of 
components k = 2 is chosen to explain total variation of 85%. The first estimated localized 
basis function, explaining 70.1% of the variation, indicates a variational mode in girls’ 
growth around age twelve, which obviously matches the well known pubertal growth spurt. 
The second estimated localized basis function, explaining 18.1% of the variation, is localized 
around ages five and six, which remarkably matches the mid-growth spurt previously studied 
by many researchers (Gasser et al., 1985; Sheehy et al., 1999). The mid-growth spurt is a 
growth phenomenon during early childhood, expressed by a mild transitory acceleration of 
growth velocity between years five and eight. The individual variations in timings, durations 
and intensities of mid-growth spurt are of great interest and some hypotheses have been 
proposed for the explanation of individual differences (Miihl et ah, 1991). This particular 
“mode of variation” is not obvious in standard FPCA. The proposed LFPCA method finds 
a balance between interpretability (localization) and amounts of variance explained. 


7 Discussion 

In this paper, we propose a localized functional principal component analysis through a 
Deflated Fantope Localization method, where sequentially obtained eigenfunctions have 
guaranteed orthogonality and are allowed to be supported on different localized subdomains. 
As mentioned in Section 2, the deflated Fantope V n can easily be generalized to a d- 
dimensional version T>^. In some applications, one might be interested to estimate mutually 
orthogonal principal subspaces with dimensions d±,... ,dk, and each principal subspace is 
only supported on a subdomain Tj C T. 

Throughout the paper, we mainly focus on a dense and regular functional data design where 
p equally spaced observations are recorded on each curve. Most commonly seen functional 
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Figure 5: Top Row: Estimated eigenfunctions for the growth data, p\ is chosen by 5- 
fold cross validation; Bottom Row: Estimated orthogonal basis functions, p 2 is chosen to 
maintain rFVE at 70%, and the number of components k = 2 is chosen to explain at least 
85% of the total variance. 


data have this design and a sample covariance can be easily computed from the discrete 
and possibly noisy observations. Our proposed formula (5) takes the sample covariance 
S and outputs smooth and localized estimates of eigenfunctions. In fact the proposed 
method puts rather minimal requirement on the input matrix S and does not rely on the 
design of observations. Consider the discretized version of T by evaluating on p x p equally 
spaced grid points. The estimation error of the eigenfunctions directly ties to the maximum 
entry-wise error of the input matrix S. Here we briefly discuss several scenarios where a 
sample covariance is not feasible, (i) For dense but irregularly observed functional data, 
one can simply smooth each curve (Ramsay & Silverman, 2005) or interpolate between 
points to get p equally spaced observations, and then a p x p covariance estimate S can 
be computed. This is what we have done for the Berkeley growth data, (ii) For sparse 
functional data where the observations are recorded at random and sparse time points, 
individual smoothing or interpolation is impossible. But a uniformly consistent covariance 
estimation is possible by, for example, two-dimensional smoothing methods (Yao et al., 
2005; Li et al., 2010). Our proposed LFPCA can then be applied by taking S as the 
discretized version of the smooth covariance estimator, (iii) For ultra dense and noisy data, 
the independent measurement errors accumulate if one uses sample covariance computed 
from the raw measurements. Moreover, using a large p x p matrix is not computationally 
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efficient. We recommend performing pre-smoothing or pre-binning on individual curves and 
choosing a grid with a moderate size p. 

We proposed two methods of choosing the localization parameter p 2 . When the cross- 
validation method chooses P 2 = 0 , it roughly means that the true eigenfunctions are not 
localized. Then for a fixed number of a , we find a set of orthogonal basis functions that 
retain the ability to explain a fair amount of variance (at least a (1 — a) proportion) and are 
localized. In this case, the outcome would depend on the choice of a and they should not 
be interpreted as the true eigenfunctions. Rather, these localized basis functions and the 
corresponding projection scores have ready interpretation with domain knowledge. 


8 Appendix 

Proof of Lemma 1. Because V n is a compact set, we know that B = Vt> [[ (A) exists and 
is unique. Let G = U T BU , then we have G £ J rl and B = UGU T . Note that G minimizes 
11A — UGU t \\ 2 f over and 

||A - UGU t \\% =\\A\\ 2 f - 2 (A, UGU t ) + \\UGU t \\ 2 f 
=\\A\\ 2 f -2(U t AU,G) + \\G\\ 2 f 
= \\A\\ 2 f - \\U t AU\\ 2 f + | \U t AU - G\\ 2 f . 

Therefore, G is the projection of U T AU onto J- 1 and by Lemma 4.1 of Vu et al. (2013) we 
have 

p—d 

G = 7 * + wrf 

i =1 

with 7 j, r/j, and 9 specified in the theorem. ■ 

Proof of Theorem 2. For any given r > 0, define the augmented Lagrangian of ( 8 ) as 
Lr(H, Z, Y ) = l Vn (H) - (S , H) + P2PII71 + (Y, H - Z) + T - \\H - zf F . 

The update steps in Algorithm 1 now reads, letting W^ = rhW, 

H M = argmin L T (H, Z^ r ~ l \W {r ~ l) ) , 

H 

zW = argmin L T (H^ r \Z, W (r) ), 
z 

Y (r) = Y( r -V + t(H — Z) . 
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It is obvious that Ix> (H) — (S,H) and p 2 ||Z||i 1 are closed, proper, and convex functions. 
Here we say a function / is closed, proper and convex if {(x,i) : f{x) < t} is a closed 
non-empty convex set (Boyd et al. (2011), Section 3.2). 

By strong duality, we can find a primal-dual pair of Lq(H, Z, Y), denoted as (H**, Z**,Y**). 
It then follows from the primal and dual optimality that (H**,Z**,Y**) is a saddle point 
of Lq and hence by Section 3.2.1 of Boyd et al. (2011), we have 


Z (r) Z* and H (r) - Z (r) -> 0, as f 
where (H *, Z*) is an optimal primal variable for Lq. 


oo , 


Proof of consistency result. To prove Theorem 3, we need some additional lemmas and 
notation as follows. The proof of lemmas are given after the proof of Theorem 3. 

Let Ij = ((j - 1 )/p,j/p\ for j = 2and I\ = [0,1/p]. We define (j)*At) = for 


t e Ii, u* = p 1 / 2 (^(ti),^(t 2 ), and Uj = 


Ua 


V 1'2- 


Let T : [0, l] 2 i->- [0,oo) 


be such that T(s,t) = T(fj, tj), if s € Ij,t € Ij . Define the discretized and diagonal-shifted 
covariance matrix E by Y(l, l') = F(ti,ti’) + o,l(l = l'). Let cf)j be eigenfunctions of T and Vj 
be eigenvectors of E. Then p ^ 1 / 2 </ J (t) is the ith. entry of Vj if t € Ij. If we further denote 
the jth eigenvalue of T by A j, then (p\j + a, Vj) is an eigenvalue-eigenvector pair of E. 

Let = Yji=i v i v I ', an 0 Ej = E — n^iEn^i. Define Ho = 0 for convenience. For any 
measurable B : [0, l] 2 i —> M, let ||H|| hs = [Jr 0 ^2 B(s, t^dsdt] 1 ^ 2 be the Hilbert-Schmidt 
norm. 

Lemma 5. Under Assumptions (A1-A4), let co = L(2/5 + 1) we have for p large enough, 
we have \\vj — Uj || 2 < cop -1 , for 1 < j < k. 

Lemma 6. Under Assumptions (A1-A3), when p is large enough we have ||E||^ < c|p 2 , 
where c 2 = a 2 + 4||T||g S + L 2 . Moreover, the gap between the jth and (j + l)th eigenvalues 
of E is at least pS/2 for all 1 < j < k. 

The following lemma is elementary and can be found in Vu & Lei (2013). 

Lemma 7. Let u and v be vectors of same length with unit norm. Then 

—=\\u — v\\ 2 < || uu T — vv t \\ f < \/2||w — u|' 
v 2 

The same holds when u, v are functions and ||-||j, is replaced by 
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IHS- 


The next lemma characterizes Uj as an approximate leading eigenvector of Ej. It extends 
Lemma 4.2 of Vu & Lei (2013). 
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Lemma 8 (Approximate curvature). Let Hj be a solution to (6). Then under Assumptions 
(A2-A4), for p large enough, 


p5 

~8~ 


I Hj-ujuJ 


i2 

If 


3coC2 

2 


<(—Ej, Hj — UjuJ) , V 1 < j < k . 


Proof of Theorem 3. The claim follows if we show that sup^^fcllOj — Uj || 2 = o p ( 1). 

For simplicity denote e n := \\S — E^^, which is o p (l) by assumption (Al). Let IIo = 

n 0 = 0, vo = 0, A) = eo = 0. We use induction to show that there exist (e,, A : 0 < i < k) 
such that 


sup €i = o p ( 1), sup Pi = o p (p ), (15) 

0 <i<k 0 <i<k 


max{||A - Uj|| 2 1 ||fii - n*|| 2 } < e* (16) 

Pi(D,Vivf) < Pi. (17) 


Obviously the claim holds for k = 0. Now assume that the claim holds for k = j — 1, and 
j > 1. We will construct €j = o p ( 1) and Pj = o v (jp) satisfying (16) and (17). 

Let Ej = E — nj_iEIIj_i for j = 1, k. By Lemma 8 we have, for p large enough, 


p5 

~8 


\Hj-Uji 


\2 

If 


3cqC2 


— ( ^ j ) Hj 



(18) 


Next we need to control (E — Ej, Hj — UjuJ) and (S, Hj — UjuJ), where the first one is 
small because Hj and UjuJ are nearly orthogonal to E — E j, and the second term needs to 
be controlled by the fact that Hj is a maximizer of (6). 

For the first term (E — S j, Hj — UjuJ ), by the orthogonality constraint, we have 

(E — Ej, Hj) <Ai(II f_i, Hj) = X] |(rij-i — nj i, Hj)\ < Aiej_i <C2pej-\. 

and similarly 

(E — Y,j,UjuJ) =(E — E j,UjuJ — VjvJ) < \\Y,j-i\\ F \\ujuJ — VjvJ\\ F < V2 c2Cq , 

where the last inequality follows from Lemma 5 and Lemma 7, and therefore 

| (E - E j,Hj - UjuJ) | < c 2 p(ej- 1 + \f2cop~ 1 ). (19) 

Now we turn to the term (S, Hj — UjuJ). If we can show that 

0 <(S, Hj — UjuJ) — pi{D, Hj) + Rj , (20) 
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for some Rj = o p (p) then we have, combining (18) to (20), 


p5. 


Hj — UjUjWp <(S — S, Hj — UjuJ) — pi(D, Hj) + R) 
<e n p\\Hj - UjuJ\\ F - pi(D, Hj) + Rj , 


(21) 


where R’- = C 2 p((-j-\ + \/2cop x ) + Rj + 3cqC 2/2 . It follows that 


8e r 

HF - 


I ZT T II / . 

I H j — UjUj || tp < — -1- 


\8R', 

5p 


( 22 ) 


Since VjvJ is the closest rank one, unit norm matrix to Hj , we have 


\Vjvf - VjvJ\\ F < ||VjvJ - UjUjWp + ||UjuJ - VjvJ\\ F 


<2 ||Hj — UjUj || F + V2c 0 p 1 < ^21 + 2\ 


I8R) 

~Sp 


+ y/2 c 0 p 1 


and 


n,- n,-|| F < ||n,-_i nj_i|| F + \\vjv~ vjv 


J u j u J u j II F 


16e n 

<ej-i H ~ I- 2 \ 


i8R• 
~5p 


+ V2cop 1 =: ej 


Now it remains to find f3j. Using (21) we have 

pi(D, Hj) <e n p\\Hj - UjuJ\\p + R 'j < 2 e n pej + Rj . 

On the other hand, let A j5 i be the largest eigenvalue of Hj. Then 
e 2 

> ||Hj - UjuJW 2 p = ||-ffj-|| F - 2 (Hj, UjuJ) + 1 > X 2 j tl - 2Xjp + 1, 

where we use the fact that ||iT/|| F > X 2 - L , and (Hj,ujuj) < Xjp 11tx^ 11^ (von Neumann trace 
inequality). It then follows that A ? i > 1 — ej/2, which implies that 

Pi(D,VjvJ) <(1 - e j /2)~ 1 p 1 (D,Hj) < (1 - tj/ 2) _1 (2 e n pej + Rj) =: f3j . 

Direct verification shows that if maxo<j<j_i e* = o p ( 1), maxo<i<j-i /3j = o p (p), and Rj = 
o p (p), then €j = o p (l) and (3j = o p {jp). 

The rest of the proof is to show (20) for some Rj = o p {p). The main challenge is that Uj 
is not in the feasible set of problem (6) and hence UjuJ is not directly comparable to Hj 
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using optimality condition of (6). To overcome this difficulty, we consider uj , a modified 
version of Uj so that (a) Uj is close to Uj in £2 norm; (b) UjuJ is feasible for (6); (c) Uj is 
almost as smooth as Uj. 

Dehne Uj = (I — Uj-i)uj/\\(I — IL ; _i)wj|| . We hrst check the validity of this definition. 

I IIj —1 Uj 11 2 —IKnj—i n j— 1 ) ^j 11 2 T Illlj-iWjlla T i(uj ^ 3 ) 11 2 — L / - 1 4~ °0P 


When ej -1 is small and p large, (I — Hj-i)uj / 0, and 

(/— Uj 


||(I - n j-l)Uj|| 2 IKII2 


< 2|| (I - Hj-i)uj - Uj L < 2(e j - 1 + c 0 p x ), 


(23) 


where the last inequality holds when p is large and the first inequality follows from an 
elementary fact that, for all u , v, 


u 


\u\ 


< 2 - 


\u — v\ 


max \\u o, u 


2’ 11^1121 


Now we establish (20). By feasibility of UjuJ we have 

0 <(S, Hj - UjUj ) - p^D.Hj) + pi(D,UjuJ) - p 2 {\\H j \\ 11 - \\uju] || 1A ) 

<{S,Hj - UjuJ) - pi(D, Hj) + pi{D, UjuJ) + P 2 P + | (S,UjuJ - UjuJ) \ . (24) 

We hrst bound \(S, UjuJ — UjuJ)\: 

| (S,UjuJ - UjuJ) | <\\S\\ F \\ujuJ - Uj Uj || F < (||S|| F + US' - T,\\ F )\\v,juJ - UjuJ\\ F 

<2\/2 (c 2 + e n ) p(ej + c 0 p _1 ), 

where the last step uses (23), Lemma 6, and the fact that US' — ^ = e n . 

Now we control pi(D,UjuJ). When ej_i and p l are small enough such that ||(/ — 
n i-i)uj|| 2 > 1/V2, we have 

Pi(D, UjiiJ) = p\ || Ahj 11 2 < 2pi||A(/-n i _i)u i ||2 



/ ^ \ 2 ‘ 


3 - 1 

<4pi 

2 + ( ^||A'0i|| 2 |(i)i,u i )|J 

VI 

2piL 2 p~ A + ^2 1 + cop” 1 ) 2 

i =1 
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where the first two inequalities follow from multiple applications of Cauchy-Schwartz, 
and the last inequality holds by definition of /%, the smoothness of Uj, and the fact that 
YiZ i \(vi,Uj)\ 2 = 11 || 2 • As a consequence, we establish (20) from (24) with 


Rj =2\/2 (c 2 + e n ) (ej-ip + c 0 ) 4- p 2 p + 8 


j- 1 

PiL 2 p~ 4 + ^2 Pi{ej- i + c 0 p -1 ) 2 
i=l 


Op(p). ■ 


Proof of Theorem 4. Prom assumption A1 and Lemma 5 it suffices to prove that if 
W S - s lloo,oo = °( e n) then sup 1 < i < A .||'O i - Vj \\ 2 = 0{e n + p 2 ). 

Consider the estimation procedure given by (5) for j = 1. Let Bi be the collection of all 
p x p symmetric matrices with entries in [—1,1]. The optimization problem can be written 
in the following equivalent form. 


max min (S , H) — p 2 (Z, H). 
HGT'Ze Bi' 


Let H* be any maximizer, then 

H* = arg max (S — p 2 Z* ,H) = arg max (E + W — p 2 Z*, H) 
HgF 1 HgJ 71 


where W = S — £ and Z* is the corresponding optimal dual variable. 

By Lemma 5 the eigengap of E is of order at least p while the operator norm of W—p 2 Z* is at 
most p(|| Wllgjj 00 + p- 2 ) which is o{p). Thus applying standard spectral subspace perturbation 
theory we know that H* = v\vj where v\ is the leading eigenvector of E + W — p 2 Z*, and 
satisfies for some constant c 


\\vi - vi\\ 2 < c\\W - p 2 Z*\\ F /p < c(e n + p 2 ). 

For j = 2we use induction. Suppose that for j — 1 we have \\vj-i — Uj_i|| 2 and 
llhiy-i — nj_i|| F are bounded by 0(e n + p- 2 ). 

Now consider the procedure (5) for j. Similarly let H* be any solution and Z* the 
corresponding optimal dual variable. We have 

H* = arg max (S — p 2 Z* , H) 

Hev. 

n j-i 

= arg max ((/ - flj_i)(5 - p 2 Z*){I - n^i),#). 

The remainder of the proof focuses on analyzing the matrix ( I — IIj_i)(S' — p 2 Z*)(I — IL,_i). 
In particular, we show that its leading eigenvector is close to Vj with the desired rate. We 
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first write this matrix in four terms 


(/ - n,■_!)(£ + w - P2 z*)(i - 4_,) 

=(/ - n ,_!)£(/ - n.-O 

+ (iij-i — f[j_i)£(/ — rij-i) + (/ — nj_i)E(nj_i — 4-1) 

+ (/ - 4-1 )(W + p 2 Z*)(I - 4 _!) 

=T 0 + T 1 +T 2 . 


The main term is To- The leading eigenvector of Tq is Vj with an eigengap at least dp /2 
according to Lemma 6. Next we bound T\ and T 2 . In fact we have 

II^iIIf — 2 ll n j-i — 4-iIIfI|s|| 2 < 2 c 2 ||n i _i — 4 -iIIfp> ( 2 ^) 

where c 2 is the constant in Lemma 6 and 

\\T 2 \\ F <\\W + p 2 Z*\\ F <(e n + p 2 )p. 

Then we have 

II Ti + T 2 \\ f < 2c 2 ||n i _i - n i _i|| F p+ (e n + p 2 )p. (26) 

When n and p are large enough, ||Ti + T 2 \\ F is smaller than the gap between the first and 
second largest eigenvalues of To. Therefore, the induction completes by using Davis-Kahan 
sin© theorem (Bhatia, 1997, Theorem VII.3.1) 

II VjiiJ - VjvJ\\ F < 2||Ti + T 2 \\ F /(5p/2) < 8c 2 <5 _1 ||II,-_i - 4 -iII.f + 45 _1 (e n + p 2 ), (27) 

where c 2 is the constant given in Lemma 6. ■ 

Proof of Technical Lemmas 

Proof of Lemma 5. Note that T is a compact self-adjoint operator from T 2 (0,1) to 
T 2 (0,1) with eigen-decomposition T(s,t) = J2j=i A j</)j(s)(j)j(t). 

The Lipschitz condition on T implies that 

l|f - r||| s := J J l r p(M) - T(s,t)\ 2 dsdt < ^ ■ ( 28 ) 

By Weyl’s inequality, | Ay — Ay | < 5/2 for large p. Let Ej and Ej be the projection operators 
onto the one-dimensional subspaces spanned by (f)j and ([>j. respectively. Then 

Hi - 4>ih < C2II Ej - FIIhs < 4|F 7 IIhS < ^ . (29) 
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where the first inequality follows from Lemma 7, and the second inequality follows from the 
Davis-Kahan sin© theorem (Chapter VII of Bhatia (1997)). 

On the other hand, by assumption (A4) we have 


hh ^ Wfij ~ foWoo ^ Y V 


(30) 


which, together with (29), implies that 


(| + |) f 


Also note that 

H-“jll2=lHll2- 1 | ^ \\<t>*j ~ < Wj - <AjIIoo ^ 7^- 

Combining the previous two inequalities, we have 


u 3 ~ M2 ^ l $ t ± j ~ c o P 


I Vn ~ Uj\\ 0 < ( - + 1 


-1 


Proof of Lemma 6. The first claim follows from, letting E* be the discretized T evaluated 
at the grid, 

\m\ 2 F <2\\aI p \\ 2 F + 2\\'£*\\ 2 F = 2a 2 p + 2p 2 ||f||| s 

<2a~p + 2 p~ (2||F||h S + 2||f — 1 ||hs) — 2a 2 p + 2 p 2 ^2||r||^ s + ^ 

<p 2 (2 a 2 p~ l + 4||r||| s + L 2 ) < c 2 2 p 2 , 

where (28) is used to bound ||T — F11 HS - 

The second claim follows from the fact that the eigengaps of E are the same as those of E*, 
and by Weyl’s inequality: 

A; — Aj+i >A j — Aj+i — 211r — T|| H g > 5 — — = 5/2 . ■ 

Proof of Lemma 8. Note that Vj is the leading eigenvector of S j, with eigengap at least 
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pS /2 as implied by Lemma 6. Then we have 


Hj — UjUj H^, <21| Hj — Vjvj \\p + 2 ||VjVj — UjUj \\ Z F < E, Hj — VjVj) + 4 c^p 


T || 2 


T||2 


„2 -2 


J J J J J - p8' 


~ + ~ v 3 v J\\f + 4c oP 2 




7 1 . 8\/2coC2 2 —2 




+ 4c 0 p“ 


^ 8 / V U T\ | 12c 0 c 2 


where the first and third inequalities come from Cauchy-Schwartz, the second from the 
curvature lemma of principal subspace (Vu &; Lei (2013), Lemma 4.2), and the last holds 
provided that p is sufficiently large. ■ 
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